In this notebook, I describe the development of superencoder algorithm for identifying outliers.

superencoder functions

First I wrote a bunch of functions to allow me apply autoencoding and supervised encoding (I call this superencoder) to data. Before describing the functions, I’ll jump into the results and then come back and explain the function – this notebook will be edited!

questions

There are multiple questions to answer: 1- Are superencoders better than autoencoders on clinical observation data? 2- What is the minimum viable sample size for super encoders? 3- What is the minimum viable layers and neurons for the superencoder CNN?

methods

Before I jump into any conclusion on whether or not superencoders perform better than autoencoders on clinical observation data, I try answering questions number 2 and 3.

To produce indices for comparison, I manually identified minimum and maximum reference ranges for 28 laboratory tests (based on LOINC codes and i2b2 ontology) and assigned biologically implausible values (BIVs) for each. A list of lab tests and respective values can be found in the ref_ranges.R script that I’ve sourced above.

The functions calculate sensitivity, specificity, acuracy, and precision (and a bunch of more stuff) for a superencoder/autoencoder with a given sample size (only for the superencoder), CNN specification, and margin of error. I’ll describe these later on…

To answer both questions 2 and 3, I ran a bunch of simulations on real-world clinical observation data – unfortunately data are not publicly available, but I can share the results! Simulation results are stored in a directory from which I read in all the results.

I have run more than 100 simulations on 28 observations with 3 different sample sizes (500, 1000, and 10,000) and 5 CNN layer+neuron specifications.

First focusing on the sample size. The functions don’t include data size in the outputs. So, I separately calculated the number of rows to obtain a normalized sample size.

Now let’s visualize some of the results to see variability in sensitivity results from by sample size

ow let’s visualize some of the results to see variability in specificity results from by sample size

ow let’s visualize some of the results to see variability in accuracy results from by sample size